Automatic Acquisition of Digitized Newspapers via Internet
نویسندگان
چکیده
After our previous works on modelling a database of newspapers and designing a specially suited retrieval language, we are now developing an application to automatically acquire, summarize and store newspaper documents published in distinct web resources. This paper describes the current implementation of the acquisition process which includes the recognison of document types and the abstraction of the recognised document values. The network agents in charge of such a process are called gatherers, accordingly to the terminology used in successful web retrieval systems such as Harvest. To implement gatherers we have combined a context free grammar with some web traversing techniques, which are available in most of the current PROLOG systems (e.g. Sicstus with the library PiLLoW).
منابع مشابه
A Framework for Text Processing and Supporting Access to Collections of Digitized Historical Newspapers
Large quantities of historical newspapers are being digitized and OCRd. We describe a framework for processing the OCRd text to identify articles and extract metadata for them. We describe the article schema and provide examples of features that facilitate automatic indexing of them. For this processing, we employ lexical semantics, structural models, and community content. Furthermore, we desc...
متن کاملImproving Access to Digitized Historical Newspapers with Text Mining, Coordinated Models, and Formative User Interface Design
Most tools for accessing digitized historical newspapers emphasize relatively simple search; but, as increasing numbers of digitized historical newspapers and other historical resources become available, we can consider much richer modes of interaction with these collections. For instance, users might use exploratory search for looking at larger issues and events such as elections and campaigns...
متن کاملAutomated Processing of Digitized Historical Newspapers: Identification of Segments and Genres
Many historical newspapers are being digitized. We aim to support access to them via text analysis of the OCRd content. However, the OCR includes many errors; so extracting meaningful content from it is difficult. A pipeline of processing steps is proposed. Here, we describe the first two steps: segmentation and genre identification. The segmentation procedure based on headings was quite succes...
متن کاملAutomated Processing of Digitized Historical Newspapers beyond the Article Level: Sections and Regular Features
Millions of pages of historical newspapers have been digitized but in most cases access to these are supported by only basic search services. We are exploring interactive services for these collections which would be useful for supporting access, including automatic categorization of articles. Such categorization is difficult because of the uneven quality of the OCR text, but there are many clu...
متن کاملPivaj: an Article-centered Platform for Digitized Newspapers Newspapers Layout
PIVAJ is a platform for archived digitized newspaper emphasizing articles: extracting them from digitized documents by automated page layout analysis, OCRing them, indexing their text transcription to allow users to search for content. Crowdsourcing is used to improve the quality of the indexing, by correcting the transcription and by tagging articles with keywords. The platform has been used t...
متن کامل